OpenUCT :: Browsing by Subject "Statistical Sciences"

Browsing by Subject "Statistical Sciences"

Now showing 1 - 20 of 101

Open Access
A comparative study of stochastic models in biology
(1997) Brandão, Anabela de Gusmão; Zucchini, Walter; Underhill, Les
In many instances, problems that arise in biology do not fall under any category for which standard statistical techniques are available to be able to analyse them. Under these situations, specifics methods have to be developed to solve and answer questions put forward by biologists. In this thesis four different problems occurring in biology are investigated. A stochastic model is built in each case which describes the problem at hand. These models are not only effective as a description tool but also afford strategies consistent with conventional model selection processes to deal with the standard statistical hypothesis testing situations. The abstracts of the papers resulting from these problems are presented below.
Open Access
A multivariate statistical approach to the assessment of nutrition status
(1972) Fellingham, Stephen A; Troskie, Casper G
Attention is drawn to the confusion which surrounds the concept of nutrition status and the problem of selecting an optimum subset of variables by which nutrition status can best be assessed is defined. Using a multidisciplinary data set of some 60 variables observed on 1898 school children from four racial groups, the study aims to identify statistically, both those variables which are unrelated to nutrition status and also those which, although related, are so highly correlated that the measurement of all would be an unnecessary extravagance. It is found that, while the somatometric variables provide a reasonably good (but non-specific) estimate of nutrition status, the disciplines form meaningful groups and the variables of the various disciplines tend to supplement rather than replicate each other. Certain variables from most of the disciplines are, therefore, necessary for an optimum and specific estimate of nutrition status. Both the potential and the shortcomings of a number of statistical techniques are demonstrated.
Open Access
Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
(2022) Houston, Charles; Britz, Stefan S; Durbach, Ian
Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model.
Open Access
Agent-based model of the market penetration of a new product
(2014) Magadla, Thandulwazi; Durbach, Ian; Scott, Leanne
This dissertation presents an agent-based model that is used to investigate the market penetration of a new product within a competitive market. The market consists of consumers that belong to social network that serves as a substrate over which consumers exchange positive and negative word-of-mouth communication about the products that they use. Market dynamics are influenced by factors such as product quality; the level of satisfaction that consumers derive from using the products in the market; switching constraints that make it difficult for consumers to switch between products; the word-of-mouth that consumers exchange and the structure of the social network that consumers belong to. Various scenarios are simulated in order to investigate the effect of these factors on the market penetration of a new product. The simulation results suggest that: ■ A new product reaches fewer new consumers and acquires a lower market share when consumers switch less frequently between products. ■ A new product reaches more new consumers and acquires a higher market share when it is of a better quality to that of the existing products because more positive word-of-mouth is disseminated about it. ■ When there are products that have switching constraints in the market, launching a new product with switching constraints results in a higher market share compared to when it is launched without switching constraints. However, it reaches fewer new consumers because switching constraints result in negative word-of-mouth being disseminated about it which deters other consumers from using it. Some factors such as the fussiness of consumers; the shape and size of consumers' social networks; the type of messages that consumers transmit and with whom and how often they communicate about a product, may be beyond the control of marketing managers. However, these factors can potentially be influenced through a marketing strategy that encourages consumers to exchange positive word-of-mouth both with consumers that are familiar with a product and those who are not.
Open Access
The analysis of some bivariate astronomical time series
(1993) Koen, Marthinus Christoffel; Zucchini, Walter
In the first part of the thesis, a linear time domain transfer function is fitted to satellite observations of a variable galaxy, NGC5548. The transfer functions relate an input series (ultraviolet continuum flux) to an output series (emission line flux). The methodology for fitting transfer function is briefly described. The autocorrelation structure of the observations of NGC5548 in different electromagnetic spectral bands is investigated, and appropriate univariate autoregressive moving average models given. The results of extensive transfer function fitting using respectively the λ1337 and λ1350 continuum variations as input series, are presented. There is little evidence for a dead time in the response of the emission line variations which are presumed driven by the continuum. Part 2 of the thesis is devoted to the estimation of the lag between two irregularly spaced astronomical time series. Lag estimation methods which have been used in the astronomy literature are reviewed. Some problems are pointed out, particularly the influence of autocorrelation and non-stationarity of the series. If the two series can be modelled as random walks, both these problems can be dealt with efficiently. Maximum likelihood estimation of the random walk and measurement error variances, as well as the lag between the two series, is discussed. Large-sample properties of the estimators are derived. An efficient computational procedure for the likelihood which exploits the sparseness of the covariance matrix, is briefly described. Results are derived for two example data sets: the variations in the two gravitationally lensed images of a quasar, and brightness changes of the active galaxy NGC3783 in two different wavelengths. The thesis is concluded with a brief consideration of other analysis methods which appear interesting.
Open Access
Applications of Machine Learning in Apple Crop Yield Prediction
(2021) van den Heever, Deirdre; Britz, Stefan S
This study proposes the application of machine learning techniques to predict yield in the apple industry. Crop yield prediction is important because it impacts resource and capacity planning. It is, however, challenging because yield is affected by multiple interrelated factors such as climate conditions and orchard management practices. Machine learning methods have the ability to model complex relationships between input and output features. This study considers the following machine learning methods for apple yield prediction: multiple linear regression, artificial neural networks, random forests and gradient boosting. The models are trained, optimised, and evaluated using both a random and chronological data split, and the out-of-sample results are compared to find the best-suited model. The methodology is based on a literature analysis that aims to provide a holistic view of the field of study by including research in the following domains: smart farming, machine learning, apple crop management and crop yield prediction. The models are built using apple production data and environmental factors, with the modelled yield measured in metric tonnes per hectare. The results show that the random forest model is the best performing model overall with a Root Mean Square Error (RMSE) of 21.52 and 14.14 using the chronological and random data splits respectively. The final machine learning model outperforms simple estimator models showing that a data-driven approach using machine learning methods has the potential to benefit apple growers.
Open Access
Automated detection and classification of red roman in unconstrained underwater environments using Mask R-CNN
(2021) Conrady, Christopher; Er, Sebnem; Attwood, Colin G
The availability of relatively cheap, high-resolution digital cameras has led to an exponential increase in the capture of natural environments and their inhabitants. Videobased surveys are particularly useful in the underwater domain where observation by humans can be expensive, dangerous, inaccessible, or destructive to the natural environment. Moreover, video-based surveys offer an unedited record of biodiversity at a given point in time – one that is not reliant on human recall or susceptible to observer bias. In addition, secondary data that is useful in scientific study (date, time, location, etc.) are by default stored in almost all digital formats as metadata. When analysed effectively, this growing body of digital data offers the opportunity for robust and independently reproducible scientific study of marine biodiversity (and how this might change over time, for example). However, the manual review of image and video data by humans is slow, expensive, and not scalable. A large majority of marine data has never gone through analysis by human experts. This necessitates computer-based (or automated) methods of analysis that can be deployed at a fraction of the time and cost, at a comparable accuracy. Mask R-CNN, a deep learning object recognition framework, has outperformed all previous state-of-the-art results on competitive benchmarking tasks. Despite this success, Mask R-CNN and other state-of-the-art object recognition techniques have not been widely applied in the underwater domain, and not at all within the context of South Africa. To address this gap in the literature, this thesis contributes (i) a novel image dataset of red roman (Chrysoblephus laticeps), a fish species endemic to Southern Africa, and (ii) a Mask R-CNN framework for the automated localisation, classification, counting, and tracking of red roman in unconstrained underwater environments. The model, trained on an 80:10:10 split, accurately detected and classified red roman on the training dataset (mAP50 = 80.29%), validation dataset (mAP50 = 80.35%), as well as on previously unseen footage (test dataset) (mAP50 = 81.45%). The fact that the model performs equally well on unseen footage suggests that it is capable of generalising to new streams of data not used in this research – this is critical for the utility of any statistical model outside of “laboratory conditions”. This research serves as a proof-of-concept that machine learning based methods of video analysis of marine data can replace or at least supplement human analysis.
Open Access
Bayesian analysis of historical functional linear models with application to air pollution forecasting
(2022) Junglee, Yovna; Erni, Birgit; Clark, Allan
Historical functional linear models are used to analyse the relationship between a functional response and a functional predictor whereby only the past of the predictor process can affect the current outcome. In this work, we develop a Bayesian framework for the analysis of the historical functional linear model with multiple predictors. Different from existing Bayesian approaches to historical functional linear models, our proposed methodology is able to handle multiple functional covariates with measurement error and sparseness. The proposed model utilises the well-established connection between non-parametric smoothing and Bayesian methods to reduce sensitivity to the number of basis functions which are used to model the functional regression coefficients. We investigate two methods of estimation within the Bayesian framework. We first propose to smooth the functional predictors independently from the regression model in a two-stage analysis, and secondly, jointly with the regression model. The efficiency of the MCMC algorithms is increased by implementing a Cholesky decomposition to sample from high-dimensional Gaussian distributions and by taking advantage of the orthogonal properties of the functional principal components used to model the functional covariates. Our extensive simulation study shows substantial improvements in both the recovery of the functional regression surface and the true underlying functional response with higher coverage probabilities, when compared to a classical model under which the measurement error is unaccounted for. We further found that the Bayesian two-stage analysis outperforms the joint model under certain conditions. A major challenge with the collection of environmental data is that they are prone to measurement error, both random and systematic. Hence, our methodology provides a reliable functional data analytic framework for modelling environmental data. Our focus is on the application of our method to forecast the level of daily atmospheric pollutants using meteorological information such as hourly records of temperature, humidity and wind speed from data collected by the City of Cape Town, South Africa. The forecasts provided by the proposed Bayesian two-stage model are highly competitive against the functional autoregressive models which are traditionally used for functional time series.
Open Access
Behavioural, microhabitat, and phylogenetic dimensions of intrasexual contest competition in combatant monkey beetles (Scarabaeidae: Hopliini)
(2021) Rink, Ariella N; Altwegg, Res; Colville, Jonathan F; Bowie, Rauri C K
The importance of sexual selection as a driver of evolution, from microevolution to speciation, has overwhelmingly been studied in the context of female choice, but there is evidence that male-male competition can also drive evolution. Recent reviews of the intrasexual competition literature have developed several hypotheses of weapon divergence in both allopatry and sympatry and have suggested means by which weapon divergence may cause reproductive isolation and speciation, both alone and together with mate choice and ecological selection. Here, I assess the role of sexual selection, in the context of environmental variation at the level of the contest substrate and the developmental environment, in contributing to microevolution within the monkey beetles (Coleoptera: Scarabaeidae: Hopliini), a taxonomically and phenotypically diverse group of pollinating insects in the Greater Cape Floristic Region (GCFR) that shows a high degree of sexual dimorphism and mating behaviour driven by male-male competition. I build on previous observations of hind leg use in intrasexual male-male contest for reproductive access to females by showing that, in Heterochelus chiragricus, contests occur in the context of a significantly maleskewed sex-ratio and consist of vigorous wrestling and pushing between two males on the flower heads occupied by embedded, feeding females, who apparently exert no mate choice. Contest outcomes are influenced by hind femur size and residency effects, and I apply hypotheses informed by evolutionary game theory to assess how males make decisions regarding persistence versus retreat. I proceed to assess the evidence for the ‘divergent fighting contexts' hypothesis which predicts weapon divergence driven by intrasexual contest competition in the context of variation in the contest substrate. I find that hind leg size in another combatant monkey beetle, the species complex Scelophysa trimeni, varies across gradients of flower size among several spatially distributed populations, suggesting that variation in flower size (the contest substrate) mediates selection for weapon morphologies that maximise performance under different fighting styles necessitated by differences in the contest substrate. I also find that male elytral colour varies both across gradients in the developmental environment and with variation in flower colour, suggesting that this trait may function as an honest signal of male fitness, but also that it may be under selection to maximise signal transmission against variable backgrounds of contest substrates. Finally, I quantify the extent to which integration, modularity, multivariate allometry, and phylogenetic effects influence the evolutionary lability of male monkey beetle's hind legs, and so mediate the pace of their evolutionary diversification in response to these varying contest substrates. My findings support a two-module pattern of modularity at both static and evolutionary levels, and I find that allometric scaling relationships are conserved within S. trimeni. These findings indicate that monkey beetle weapons are relatively unconstrained in their evolutionary diversification across divergent fighting substrates. I conclude by discussing these findings within the broader field of sexual selection and monkey beetle ecology and suggest directions for further work. The findings presented here support a role for sexual selection, interacting with variation in the flower contest substrate, as being an important driver of the diversification of monkey beetles in the GCFR.
Open Access
Biplot graphical display techniques
(1991) Iloni, Karen; Underhill, Leslie G
The thesis deals with graphical display techniques based on the singular value decomposition. These techniques, known as biplots, are used to find low dimensional representations of multidimensional data matrices. The aim of the thesis is to provide a review of biplots for a practical statistician who is not familiar with the area. It therefore focuses on the underlying theory, assuming a standard statisticians' knowledge of matrix algebra, and on the interpretation of the various plots. The topic falls in the realm of descriptive statistics. As such, the methods are chiefly exploratory. They are a means of summarising the data. The data matrix is represented in a reduced number of dimensions, usually two, for simplicity of display. The aim is to summarise the information in the matrix and to present a visual representation of this information. The aim in using graphical display techniques is that the "gain in interpretability far exceeds the loss in information" (Greenacre, 1984). A graphical description is often more easy to understand than a numerical one. Histograms and pie charts are familiar forms of data representation to many people with no other, or very rudimentary, statistical understanding. These are applicable to univariate data. For multivariate data sets, univariate methods do not reveal interesting relationships in the data set as a whole. In addition, a biplot can be presented in a manner which can be readily understood by non-statistically minded individuals. Greenacre (1984) comments that only in recent years has the value of statistical graphics been recognised. Young (1989) notes that recently there has been a shift in emphasis, among statisticians towards exploratory data analysis methods. This school of thought was given momentum by the publication of the book "Exploratory Data Analysis" (Tukey, 1977). The trend has been facilitated by advances in computer technology which have increased both the power and the accessibility of computers. Biplot techniques include the popular correspondence analysis. The original proponents of correspondence analysis (among them Benzecri) reject probabilistic modelling. At the other extreme, some view graphical display techniques as a mere preliminary to the more traditional statistical approaches. Under the latter view, graphical display techniques are used to suggest models and hypotheses. The emphasis in exploratory data techniques such as graphical displays is on 'getting a feel' for the data rather than on building models and testing hypotheses. These methods do not replace model building and hypothesis testing, but supplement them. The essence of the philosophy is that models are suggested by the data, rather than the frequently followed route of first fitting a model. Some work has gone into developing inferential methods, with hypothesis tests and associated p-values for biplot-type techniques (Lebart et al, 1984, Greenacre, 1984). However, this aspect is not important if the techniques are viewed merely as exploratory. Chapter Two provides the mathematical concepts necessary for understanding biplots. Chapter Three explains exactly what a biplot is, and lays the theoretical framework for the biplot techniques that follow. The goal of this chapter is to provide a framework in which biplot techniques can be classified and described. Correlation biplots are described in Chapter Four. Chapter Five discusses the principal component biplot, and the link between these and principal component analysis is drawn. In Chapter Six, correspondence analysis is presented. In Chapter Seven practical issues such as choice of centre are discussed. Practical examples are presented in Chapter Eight. The aim is that these examples illustrate techniques commonly applicable in practice. Evaluation and choice of biplot is discussed in Chapter Nine.
Open Access
Breeding production of Cape gannets Morus capensis at Malgas Island, 2002-03
(2006) Staverees, Linda; Underhill, Les; Crawford, RJM
Includes bibliographical references.
Open Access
Building a question answering system for the introduction to statistics course using supervised learning techniques
(2020) Leonhardt, Waldo; Er, Sebnem; Scott, Leanne
Question Answering (QA) is the task of automatically generating an answer to a question asked by a human in natural language. Open-domain QA is still a difficult problem to solve even after 60 years of research in this field, as trying to answer questions which cover a wide range of subjects is a complex matter. Closed-domain QA is, on the other hand, more achievable as the context for asking questions is restricted and allows for more accurate interpretation. This dissertation explores how a QA system could be built for the Introduction to Statistics course taught online at the University of Cape Town (UCT), for the purpose of answering administrative queries. This course runs twice a year and students tend to ask similar administrative questions each time that the course is run. If a QA system can successfully answer these questions automatically, it would save lecturers the time in having to do so manually, as well as enabling students to receive the answers immediately. For a machine to be able to interpret natural language questions, methods are needed to transform text into numbers while still preserving the meaning of the text. The field of Natural Language Processing (NLP) offers the building blocks for such methods that have been used in this study. After predicting the category of a new question using Multinomial Logistic Regression (MLR), the past question that is most similar to the new question is retrieved and its answer is used for the new question. The following five classifiers, Naive Bayes, Logistic Regression, Support Vector Machines, Stochastic Gradient Descent and Random Forests were compared to see which one provides the best results for the categorisation of a new question. The cosine similarity method was used to find the most similar past question. The Round-Trip Translation (RTT) technique was explored as an augmentation method for text, in an attempt to increase the dataset size. Methods were compared using the initial base dataset of 744 questions, compared to the extended dataset of 6 614 questions, which was generated as a result of the RTT technique. In addition to these two datasets, features for Bag-of-Words (BoW), Term Frequency times Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDiA), pre-trained Global Vector (GloVe) word embeddings and customengineered features were also compared. This study found that a model using an MLR classifier with TF-IDF unigram and bigram features (built on the smaller 744 questions dataset) performed the best, with a test F1-measure of 84.8%. Models using a Stochastic Gradient Descent classifier also performed very well with a variety of features, indicating that Stochastic Gradient Descent is the most versatile classifier to use. No significant improvements were found using the extended RTT dataset of 6 614 questions, but this dataset was used by the model that ranked eighth in position. A simulator was also built to illustrate and test how a bot (an autonomous program on a network that is able to interact with users) can be used to facilitate the auto-answering of student questions. This simulator proved very useful and helped to identify the fact that questions relating to the Course Information Pack had been excluded from the data that had been initially sourced, as students had been asking such questions through other platforms. Building a QA system using a small dataset proved to be very challenging. Restricting the domain of questions and focusing only on administrative queries was helpful. Lots of data cleaning was needed and all past answers needed to be rewritten and standardised, as the raw answers were too specific and did not generalise well. The features that performed the best for cosine similarity and for extracting the most similar past question were LSA topics built from TF-IDF unigram features. Using LSA topics as the input for cosine similarity, instead of the raw TF-IDF features,resolved the “curse of dimensionality”. Issues with cosine similarity were observed in cases where it favoured short documents, which often led to the selection of the wrong past question. As an alternative, the use of more advanced language-modelling-based similarity measures are suggested for future study. Either, pre-trained word embeddings such as GloVe could be used as a language model, or a custom language model could be trained. A generic UCT language model could be valuable and it would be preferable to build such a language model using the entire digital content of Vula across all faculties where students converse, ask questions or post comments. Building a QA system using this UCT language model is foreseen to offer better results, as terms like “Vula”, “DP”, “SciLab” and “jdlt1” would be endowed with more meaning.
Open Access
Business process modelling and simulation with application to a start-up actuarial firm
(2015) Gweshe, Tatenda Mark; Scott, Leanne
In our research, we set out to model, understand and evaluate the business process at a start-up actuarial firm which employs Report Writers (RWers) who specialise in quantifying actuarial matters. We simulated various "what-if" and extreme scenarios relating to (1) the impact of qualitative variables (stress, morale and health) on RWer productivity, (2) hiring policies for RWers who have various skills sets, (3) the allocation of RWers to various roles within the process, (4) the impact that a high turnover of experienced RWers has on productivity, (5) the impact of introducing a flexible working arrangement (flexitime). This was done through business process modelling and simulations. The business process we modelled was governed by numerous potentially complex inter-relationships between variables and inter-relationships, which we believed could lead to potentially significant feedback loops. The models we built were then simulated over a period of 3 to 7 years to gain insights into the behavioural trends of the firm's business process over time when subject to "what-if" scenarios and policy implementations. The model simulations allowed us to get an understanding of the behaviour of processes over time, and the key variables and relationships involved in bringing about such behaviour as certain variables were subjected to changes in levels, as set out in our objectives. We made use of relevant literature, expert opinion, past data, questionnaires and cognitive mapping techniques to build simulation models. Guided by methodologies used in literature on modelling qualitative variables, bearing in mind the dangers in modelling for them, we modelled for the complex inter-relationships between qualitative and quantitative variables.
Open Access
Calculation of calibration factors from the comparative fishing trial between FRS Africana and RV Dr Fridtjof Nansen
(2008) Antony, Luyanda Lennox; Dunne, Tim; Leslie, Rob W
Includes abstract. Includes bibliographical references (leaves 153-157).
Open Access
Cape Town road traffic accident analysis: Utilising supervised learning techniques and discussing their effectiveness
(2022) Du Toit, Christo; Er, Sebnem; Salau, Sulaiman
Road traffic accidents (RTA) are a major cause of death and injury around the world and in South Africa. Methods to understand and reduce the frequency and injury severity of RTAs are of utmost importance. There is limited South African literature on modelling RTA injury-severity using supervised learning (SL) methods that fit a model that relates a target variable to a set of predictor variables. In this thesis, multinomial logistic regression, classification trees (CT), random forests (RF), gradient boosted machines (GBM) and artificial neural networks (ANN) are used to model the potentially non-linear relationships between accident-related factors and injury-severity. Data on RTAs that occurred in the city of Cape Town during the period 2015-2017 are used for this study. The data contain the injury-severity of the RTAs as well as several accident related variables. The injury-severity categories of RTAs are classified as: “no injury”, “slight”, “serious” and “fatal” injury. Additional locational and situational variables were added to the dataset. The exploratory analysis revealed that the vast majority of alleged causes (as deduced by the data capturers from the accident report) of RTAs are related to driver/human error, accidents with pedestrians make up only 5.86% of all RTAs yet account for 58.56% of “fatal” accidents and 55.37% of “serious” accidents, the majority of “fatal” and “serious” RTAs occur on the weekend and involve only one vehicle. It was also identified that the RTA data was severely imbalanced with regards to injury-severity. Imbalanced data occur when the number of observations belonging to each of the classification categories are not approximately equal and can negatively affect the performance of classification methods. This paper employed three common approaches to address class imbalance namely (i) under sampling of the majority class, (ii) oversampling of the minority class and (iii) the synthetic minority oversampling technique (SMOTE). The RTA data was split into training, validation and test sets keeping the proportions of the injury-severity category consistent. Four training datasets were analysed: the original imbalanced data, data with the minority class over-sampled, data with the majority class under-sampled and data with synthetically created observations. The performance of the SL methods trained on these four different datasets were compared using accuracy, recall, precision and F1 score as evaluation metrics. All three data sampling methods improved the CT, RF and GBM model's average recall and ability to identify observations belonging to the minority class (“fatal” RTAs). With regards to maximising average recall, the SMOTE technique was the most effective data sampling method to address class imbalance. Further analysis was done to determine whether simple SL methods such as multinomial logistic regression are sufficient to model RTA injury-severity or if more complex SL methods such as ANNs are required. The ANN model achieved a higher average recall and correctly identified more observations belonging to the minority class, “fatal” RTAs, than the multinomial logistic regression model. Using average recall as the main evaluation metric, the ANN was selected as the “best” performing model on the validation data. The ANN model correctly identified a large number of “fatal” RTAs while also resulting in a high number of false positives. The ANN model was very effective at correctly identifying “no injury” RTAs as evidenced by the high recall and precision scores, but performed poorly at correctly identifying “slight” and “serious” RTAs. Finally, the variable importance of the CT, RF and GBM models trained on the SMOTE data revealed the geographical location of an RTA, crash type as well as the number of vehicles involved in an accident to be significant risk factors associated with RTA injury-severity. The CT and RF models both determined the alleged cause of an accident to be significant, while the RF and GBM models determined several weather-related variables to be significant risk factors associated with RTA injury-severity. Future road safety policies should focus on reducing human/driver error, reducing pedestrian-related RTAs and increasing policing efforts over weekends and during poor weather conditions. Road safety policies should take the geographical location of RTAs into account in order to identify high-risk areas for “serious” and “fatal” RTAs.
Open Access
Changes in rainfall seasonality in the Western Cape, South Africa: an exploration of methods for determining the start and end of the rainfall season
(2020) Ivey, Peter; Erni, Birgit
The aim of this thesis is to detect and analyse changes in seasonality in rainfall for various groups of weather stations in the Western Cape area. Weather stations with similar seasonal patterns are firstly grouped together using certain clustering algorithms. The start and end of the rainfall season dates for the different groups of weather stations are estimated and then compared over time to determine whether there have been any changes. Once these start and end of season dates have been estimated, the length of the rainfall season is estimated and compared over time. Studies have been performed globally and over southern Africa attempting to analyse rainfall patterns and changes. However, rainfall is the most unstable climate variable in terms of time and space and thus, it is really difficult to predict (Yaman, 2018). Most studies have pointed toward an increase of extreme events on both sides of the scale i.e. more intense flooding and more severe drought being experienced. Some places are also starting to experience more rainfall than before whilst other places are starting to experience more drought. The impacts of these rainfall changes are already being experienced with many areas being forced to adapt to the new conditions. Many better decisions can be made with a better understanding of how rainfall seasons are changing. In the agricultural industry, better informed decisions about when the rainfall season is likely to start and end can result in more optimal yield from crops. Changes in rainfall can also affect the type of crops that should be planted. Farmers will also be able to better prepare for drought seasons if they are better informed as to when these drought periods will likely occur. In terms of disaster risk management, the more that is known about rainfall patterns, the better prepared regions can be for the inevitable increase in extreme events. Cities can put in better systems now in order to deal with potential future crises. Cape Town is an example of a city that could have possibly been better prepared for the current drought crisis if there was a better understanding of rainfall trends. Hopefully in the future, with more accurate information about rainfall, it can rather be an active process than a reactionary process to the current climate conditions.
Open Access
Cold winters vs long journeys : adaptations of primary moult and body mass to migration and wintering in the Grey Plover Pluvialis Squatarola
(2002) Serra, Lorenzo; Underhill, Les
The Grey Plover Pluvialis squatarola is a circumpolar breeding wader with a cosmopolitan winter distribution. Primary moult generally starts only when potential wintering sites are reached. Across the Palearctic-African region Grey Plovers experience an enormous variety of ecological and climatic conditions, which determine the development of different moult patterns, according to local conditions and timing of migration.
Open Access
Comparison of ridge and other shrinkage estimation techniques
(2006) Vumbukani, Bokang C; Thiart, Christien
Shrinkage estimation is an increasingly popular class of biased parameter estimation techniques, vital when the columns of the matrix of independent variables X exhibit dependencies or near dependencies. These dependencies often lead to serious problems in least squares estimation: inflated variances and mean squared errors of estimates unstable coefficients, imprecision and improper estimation. Shrinkage methods allow for a little bias and at the same time introduce smaller mean squared error and variances for the biased estimators, compared to those of unbiased estimators. However, shrinkage methods are based on the shrinkage factor, of which estimation depends on the unknown values, often computed from the OLS solution. We argue that the instability of OLS estimates may have an adverse effect on performance of shrinkage estimators. Hence a new method for estimating the shrinkage factors is proposed and applied on ridge and generalized ridge regression. We propose that the new shrinkage factors should be based on the principal components instead of the unstable OLS estimates.
Open Access
Constrained portfolio selection with Markov and non-Markov processes and insiders
(2007) Durrell, Fernando; Ouwehand, Peter; Abraham, Haim
Word processed copy. Includes bibliographical references (p. 158-168).
Open Access
The construction of a partial least squares biplot
(2014) Oyedele, Opeoluwa Funmilayo; Lubbe, Sugnet
In multivariate analysis, data matrices are often very large, which sometimes makes it difficult to describe their structure and to make a visual inspection of the relationship between their respective rows (samples) and columns (variables). For this reason, biplots, the joint graphical display of the rows and columns of a data matrix, can be useful tools for analysis. Since they were first introduced, biplots have been employed in a number of multivariate methods, such as Correspondence Analysis (CA), Principal Component Analysis (PCA), Canonical Variate Analysis (CVA) and Discriminant Analysis (DA), as a form of graphical display of data. Another possible employment is in Partial Least Squares (PLS). First introduced as a regression method, PLS is more flexible than multivariate regression, but better suited than Principal Component Regression (PCR) for the prediction of a set of response variables from a large set of predictor variables. Employing the biplot in PLS gave rise to the PLS biplot, a new addition to the biplot family. In the current study, this biplot was successfully applied to the sensory data to investigate the relationships between the sensory panel characteristics and the chemical quality measurements of sixteen olive oils. It was also applied to a large set of mineral sorting production data to investigate the relationships between the output variables and the process factors used to produce a final product. Furthermore, the PLS biplot was applied to a Binomialdistributed data concerning the diabetes testing of Indian women and to a Poisson-distributed data showing the diversity of arboreal marsupials (possum) in the Montane ash forest. After these applications, it is proposed that the PLS biplot is a useful graphical tool for displaying results from the (univariate) Partial Least Squares-Generalized Linear Model (PLS-GLM) analysis of a data set. With Partial Least Squares Regression (PLSR) being a valuable method for modelling high-dimensional data, especially in chemometrics, the PLS biplot was successfully applied to a cereal evaluation containing one hundred and forty five infrared spectra and six chemical properties, and a gene expression data with two thousand genes.